Skip to content

Conversation

@neoblizz
Copy link
Member

@neoblizz neoblizz commented Feb 5, 2026

Motivation

Dynamically take away tile ids instead of fixed partitioning.

Getting Started

git clone -b neoblizz/work-stealing https://github.com/ROCm/tritonBLAS
cd tritonBLAS
pip install -e .

# Install latest triton
git clone https://github.com/triton-lang/triton
cd triton
pip install -e .

# Work-stealing CU sweep (304 to 32 CUs)
python benchmarks/tritonblas_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --work-stealing \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --counters-per-xcd 1 \
    --output-csv results_ws_cu_sweep.csv

python benchmarks/torch_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --output-csv results_torch_cu_sweep.csv

python tools/plot_cu_sweep.py \
    --persistent results_persistent_sweep.csv \
    --torch      results_torch_cu_sweep.csv \
    --ws-cpc 1   results_ws_cu_sweep.csv \
    -o cu_sweep_plot.png

Copilot AI review requested due to automatic review settings February 5, 2026 20:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a work-stealing-based persistent GEMM kernel that dynamically allocates tile IDs across compute units instead of using fixed partitioning. The implementation uses per-XCD (chiplet) atomic counters to reduce contention compared to global atomic operations. The work-stealing kernel is exposed as an opt-in feature through a new work_stealing parameter in the matmul APIs.

Changes:

  • Added MatmulConfig class to pre-allocate and manage GPU buffers for kernel launches (tile counters, stream-K locks/partials)
  • Implemented work-stealing kernel with per-XCD atomic tile counters in persistent_gemm_work_stealing.py
  • Extended all matmul APIs with optional work_stealing and config parameters to support the new kernel

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
include/tritonblas/matmul.py Added MatmulConfig class for buffer management; integrated work_stealing parameter and ws_persistent_matmul kernel; refactored buffer allocation to use config objects
include/tritonblas/kernels/persistent_gemm_work_stealing.py New work-stealing kernel implementation with per-XCD atomic counters and dynamic tile assignment
include/tritonblas/kernels/__init__.py Exported ws_persistent_matmul kernel
include/tritonblas/__init__.py Exported MatmulConfig and matmul_preamble to public API
tests/test_work_stealing.py Standalone test with custom module loading to test work-stealing kernel correctness and performance
benchmarks/benchmark_work_stealing.py Comprehensive benchmark comparing work-stealing against static persistent, stream-K, and torch.matmul

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ryanswann-amd ryanswann-amd self-requested a review February 12, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant